Harnessing the Hidden Potential of Databases

Exploring the Why and How of Database Utilization in Research and Learning

November 2, 2023

Slides URL

The slides contain links, and so here is a link to this exact slide deck

Talk Structure

  • Background: How did I get here?
    • Introduction
    • Past research
  • What are databases?
    • Basics
    • Compare and contrast to spreadsheets
  • Installation and Use
    • Options: from simple to complex
    • Interfaces
  • Examples
  • Summary

Background: How did I get here?

A Timeline

Grad Research

  • My graduate research centered around refining our estimates of the mass distribution of the near-Earth asteroid population
  • Too small to be detected by most means, my detection screen was the shadowed portion of the moon
  • Involved many nights of observations
    • As commonly happens, the amount of data snuck up on me

Tracking Observations

Willamette Research

  • Cloudy skies makes lunar observations impractical
  • What if Earth’s skies became the “detection screen” instead?
    • Any part of the sky being clear could allow for observations
    • Need an automated recording system
  • Detection analysis done onboard with a microcomputer
  • The data load still crept up on me! Despite more efforts!

Currently

  • Upon joining the SCIS faculty, I took on all the database and data management courses
    • I’ve taught the undergrad DATA 351 course twice
    • I’ve taught the graduate DATA 503 course three times, soon to be 5 this spring
  • I continue to teach the introductory programming courses in Python, and run the Section Leader program
  • I teach courses that lie at the intersection of the natural sciences and data science, currently Data in the Cosmos

What is a Database?

A Family

  • There are actually a variety of types of databases, each with their tradeoffs
    • Relational databases
    • Document databases
    • Graph databases
  • For those coming from familiarity with spreadsheets, relational databases will probably feel the more intuitive
  • Relational databases also happen to still be incredibly popular in industry

Why Relational Tables?

  • Suppose you wanted to keep track of your friends’ birthdays
    • This could easily be accomplished in a single table
First Name Last Name Birthday
Frank Stein 4/2/2000
Tessa Loch 8/23/2003
Bobby Wolf 12/14/2005

Complicating Things

  • What if you also wanted to track what classes each friend was taking?
  • Could imagine each friend taking 2-3 classes, some of which overlap
  • This complicates things more than might be expected, since to still store this in a single simple table now requires data duplication

Classy Friends

First Name Last Name Birthday Class Class Day Class Time
Frank Stein 4/2/2000 CS151 MWF 1:00pm
Frank Stein 4/2/2000 MATH256 MWF 9:00am
Tessa Loch 8/23/2003 CS151 MWF 1:00pm
Tessa Loch 8/23/2003 IDS236 TTh 1:00pm
Tessa Loch 8/23/2003 HIST123 MWF 12:00pm
Bobby Wolf 12/14/2005 IDS236 TTh 1:00pm
Bobby Wolf 12/14/2005 MATH256 MWF 9:00am

Data Duplication = BAD

Relational Tables

  • One solution to this issue is realizing that we are trying to actually keep track of two things: friends and classes.
  • So we break things up into two tables, and then create relationships between them
  • This is the core of what occurs in a relational database!
First Name Last Name Birthday
Frank Stein 4/2/2000
Tessa Loch 8/23/2003
Bobby Wolf 12/14/2005
Class Day Time
CS151 MWF 1:00pm
MATH256 MWF 9:00am
IDS236 TTh 1:00pm
HIST123 MWF 12:00pm
PHYS221 MWF 10:00pm

Viewing Relational Databases

  • In general, you’d probably use a third table to represent all the linkages
  • Unique id keys are used to connect the different tables


Relational Database Schema

Differences from Spreadsheets

  • Databases have a much more rigid structure
    • You define up front what columns you want, and the data type associated with that column
    • Each row represents a record or observation, and data is entered one row at a time.
  • Calculations are generally done through a query, not stored in cells within the table
    • Eliminates any confusion between the original data and derived values
  • Only the data matters. There is no coloring or other formatting.
    • Visualizations are handled by external tools

Benefits over spreadsheets

  • Data Integrity
    • Columns enforce types, calculations can not be mistaken for data, relationships between tables enforced
    • More effective at eliminating duplicate information, and the possible issues that arise
  • Far better performance for huge numbers of records
  • Database servers are build for multi-user access, facilitating sharing of data and with full user access control and permissions
  • Excellent support from all popular coding languages for accessing database information
  • Separates where the data is stored from where it is analyzed

SQL

  • Pronounced “ESS-QUE-ELL”, or, historically “sequel”, SQL is the dominant language for interacting with relational databases
  • A declarative language, so fairly straightforward to learn
  • Primarily used for managing the database, adding records to the database, or extracting specific information from the database
SELECT column_names
FROM table_name
WHERE condition_is_true
ORDER BY column_name;

Installation and Use

The Spice of Life

  • Relational databases come in a surprisingly large variety of forms and formats
  • Many are free and/or open source
  • All use a core set of standard SQL, and then start to branch from there
    • Extra functions that are made available
    • Different data types available for columns
  • Options can be overwhelming, but we’ll cover a few common cases

Easiest: SQLite

  • SQLite stores each database in a single file
    • This can make sharing or moving around the database very simple, as people are well acquainted with working with files
  • Installation is simple, or even unnecessary, as the tools for creating, managing, and accessing the database are just bundled in a zip file
    • Files available here!
  • Documentation on website is good
  • Many programming languages can interface with SQLite databases
  • Simple nature and easy installation make it a great starting point

Local Server

  • Most other database installations operate on a server model
    • The server operates as the central hub that clients can connect to to interact with the database
    • Nothing special about the server aside from it needing to be powered on in order for the data to be accessible
  • Could install on a personal system or on a lab or other public system if you want easy access from multiple individuals
  • The big players here include PostgreSQL, MySQL, SQL Server, and Oracle
    • I’ve been teaching PostgreSQL in class and been very happy
      • Installation is more involved, but still not overly complex through EDB here or through Postgres.app here for Mac users

Remote Server

  • If a local server install is too intimidating, or you don’t have a good computer to use as a server, there is no shortage of online deployment options as well, though they come with a cost
  • The tech giants have offerings through Azure, AWS, or Google Cloud Platform, but I’ve found them to be overly complicated for most individual or small group needs
  • My favorite recommendation for the past year has been Railway.app, though it is not quite as free now as it once was
    • Digital Ocean also has super simple setups, though it runs a bit more expensive

SQL Client

  • For any server-based database (though they tend to work with SQLite as well), you need a client to connect to, view, and run SQL code against the database
  • The list of possible clients is massive and can vary between operating systems
  • Many are capable of connecting to most of the different database server options
  • My favorite at the moment is Beekeeper Studio (Community Edition)
  • Another popular client that will connect to most anything is DBeaver
  • Most any client will let you browse table, make edits directly in tables, or run SQL to otherwise manage the database

NocoDB

  • While most SQL clients will show a grid view of a table and may allow edits, they aren’t really intended for that to be the main method of interaction
  • If you want to leverage the benefits of a database while still leveraging the comfort of a spreadsheet, some options exist
    • Airtable is a popular option that streamlines working with a database for non-coding folks. Free small plans exist.
    • NocoDB is an open-source offering that I came across recently which aims to mimic much of what has made Airtable so popular
  • Either option can connect to an internal database or to an external, already existing database
  • Both can significantly simplify allowing multiple users to edit the database, or provide read-only views to certain viewers
  • Both also allow the creation of basic forms to streamline entering new information into the database, which otherwise requires some coding knowledge

Examples

Polling

  • For years now I’ve used an SQLite database to record student submissions to my in-class polls
  • SQLite database and the web server that runs the polling form live on a Raspberry Pi microcomputer
  • Later I go back and query the responses to generate my participation scores
  • Over 18k submissions stored

Grades

  • I generate my own grade reports for my classes, and have done so for years
  • Switching to using a database on the backend has made this so much easier
  • Also allows me total control over implementing other programs that can interface with this system
    • Grading rubrics for helping me grade
    • Grade reporting systems
    • Interfacing with submission systems (GitHub Classroom)

Conclusions

Pros/Cons

  • More robust toward data integrity and organization
  • More performant as the amount of data increases
  • Able to be easily connected to from a host of other software or programs
  • Excellent for storage of accruing data for later analysis
  • Do one thing, and do it very well
  • Are incredibly flexible in being able to store, analyze, and visualize data all at once
  • Are more visual to use, requiring no knowledge of a coding language or SQL
  • Generally expected to store mainly numbers or text, and can struggle with other data formats
  • Perfect for quick analysis of relatively small or uncomplicated datasets

Wrapping Up

  • Thanks to all of you for attending!
  • Hopefully I’ve at least put the thought in your head that in certain cases databases can be worth the shift.
    • They have massively impacted the way I operate (for the better) these past several years.
  • Please know that I’m always happy to help advise or assist anyone with database related questions.
// reveal.js plugins // Added plugins